The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
The points distribution for this case is as follows:
Attribute Information:
ATTRIBUTES
COMPACTNESS (average perim)**2/area
CIRCULARITY (average radius)**2/area
DISTANCE CIRCULARITY area/(av.distance from border)**2
RADIUS RATIO (max.rad-min.rad)/av.radius
PR.AXIS ASPECT RATIO (minor axis)/(major axis)
MAX.LENGTH ASPECT RATIO (length perp. max length)/(max length)
SCATTER RATIO (inertia about minor axis)/(inertia about major axis)
ELONGATEDNESS area/(shrink width)**2
PR.AXIS RECTANGULARITY area/(pr.axis length*pr.axis width)
MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this)
SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS
SCALED VARIANCE (2nd order moment about major axis)/area ALONG MINOR AXIS
SCALED RADIUS OF GYRATION (mavar+mivar)/area
SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3 MAJOR AXIS
SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3 MINOR AXIS
KURTOSIS ABOUT (4th order moment about major axis)/sigma_min**4 MINOR AXIS
KURTOSIS ABOUT (4th order moment about minor axis)/sigma_maj**4 MAJOR AXIS
HOLLOWS RATIO (area of hollows)/(area of bounding polygon)
Where sigma_maj2 is the variance along the major axis and sigma_min2 is the variance along the minor axis, and
area of hollows= area of bounding poly-area of object
The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon.
NUMBER OF CLASSES
CAR, BUS, VAN
import numpy as np #import numpy
import pandas as pd #import pandas
import seaborn as sns # import seaborn
import matplotlib.pyplot as plt #import pyplot
from scipy.stats import pearsonr #for pearson's correlation
from sklearn.model_selection import train_test_split #for splitting the data in train and test
from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler #for various scaling methods
from sklearn.linear_model import LogisticRegression #for LogisticRegression
from sklearn.naive_bayes import GaussianNB #for NaiveBayes
from sklearn.neighbors import KNeighborsClassifier #for KNN
from sklearn.svm import SVC #for Support vector classifier
from sklearn.tree import DecisionTreeClassifier #for decision tree classification
#from sklearn.feature_extraction.text import CountVectorizer #DT does not take strings as input for the model fit step....
from IPython.display import Image #for image
from sklearn import tree #for tree
from os import system #using user environment
from sklearn.ensemble import BaggingClassifier #for bagging classifier
from sklearn.ensemble import AdaBoostClassifier #for adaptive boosting
from sklearn.ensemble import GradientBoostingClassifier #for gradient boosting
from sklearn.ensemble import RandomForestClassifier #for random forest
from sklearn.preprocessing import LabelEncoder #for lebel encoder
from scipy.stats import zscore #for zscore
from sklearn.decomposition import PCA #for PCA
from sklearn.model_selection import KFold,cross_val_score #for cross validation
from sklearn.tree import export_graphviz #for exporting dot data
from sklearn.externals.six import StringIO #for stringIO
from IPython.display import Image #for including image
import pydotplus #for dot data
import graphviz #for visualizing decision tree
from statistics import median,mean #for median and mean functions
from sklearn.metrics import accuracy_score,confusion_matrix,recall_score #for accuracy matrices
from sklearn.metrics import precision_score,classification_report,roc_auc_score,precision_score #for accuracy matrices
DataFrame = pd.read_csv('vehicle.csv',dtype={'class': 'category'}) #reading the CSV file
DataFrame.head(10) #to check head of the dataframe
DataFrame.tail() #to check tail of the dataframe
print('\033[1m''Number of rows in dataframe',DataFrame.shape[0]) #for number of rows
print('\033[1m''Number of features in dataframe',DataFrame.shape[1]) #for number of features
DataFrame.dtypes.to_frame('Datatypes of attributes').T #for datatypes of attributes
DataFrame.isnull().sum().to_frame('Presence of missing values').T #for checking presence of missing values
DataFrame.describe().T #for 5 point summary
col_names=DataFrame.columns.values.tolist()
sns.set(context='notebook',style='whitegrid', palette='dark',font='sans-serif',font_scale=1.2,color_codes=True)
fig, axes = plt.subplots(nrows=9, ncols=2)
count=0
for i in range (9):
for j in range (2):
col=col_names[count+j]
sns.distplot(DataFrame[col].values,ax=axes[i][j],bins=30,color="tab:cyan")
axes[i][j].set_title(col,fontsize=17)
fig=plt.gcf()
fig.set_size_inches(8,20)
plt.tight_layout()
count=count+j+1
Note:
plot=sns.countplot(x=DataFrame['class'],data=DataFrame) #Countplot of 'class'
DataFrame['class'].value_counts().to_frame('Target column distriution') #Value counts of Target column
DataFrame.skew().to_frame('Skewness measure').T #for measure of skewness
col_names=DataFrame.columns.values.tolist()#column names
fig, axes = plt.subplots(nrows=9, ncols=2) #create subplots 9rows x 2columns
count=0
for i in range (9):
for j in range (2):
col=col_names[count+j]
sns.boxplot(DataFrame[col].values,ax=axes[i][j],color="tab:cyan")
axes[i][j].set_title(col,fontsize=17)
fig=plt.gcf()
fig.set_size_inches(8,20)
plt.tight_layout()
count=count+j+1
df_copy = DataFrame.copy() #making a copy of dataframe for preprocessing
encoder = LabelEncoder() #creating object of LabelEncoder
df_copy['class'] = encoder.fit_transform(df_copy['class']).astype(int) #encoding 'class' column
df_copy.head() #displaying head of encoded dataframe
df_copy['class'].dtype #for datatype
df_copy[['class']] = df_copy[['class']].apply(pd.Categorical)#changing datatype of attribute to categorical
df_copy['class'].dtype #for datatype
df_copy['circularity'].fillna((df_copy['circularity'].mean()), inplace=True) #Imputing with mean
df_copy['distance_circularity'].fillna((df_copy['distance_circularity'].mean()), inplace=True) #Imputing with mean
df_copy['radius_ratio'].fillna((df_copy['radius_ratio'].mean()), inplace=True) #Imputing with mean
df_copy['pr.axis_aspect_ratio'].fillna((df_copy['pr.axis_aspect_ratio'].mean()), inplace=True) #Imputing with mean
df_copy['scatter_ratio'].fillna((df_copy['scatter_ratio'].mean()), inplace=True) #Imputing with mean
df_copy['elongatedness'].fillna((df_copy['elongatedness'].mean()), inplace=True) #Imputing with mean
df_copy['pr.axis_rectangularity'].fillna((df_copy['pr.axis_rectangularity'].mean()), inplace=True) #Imputing with mean
df_copy['scaled_variance'].fillna((df_copy['scaled_variance'].mean()), inplace=True) #Imputing with mean
df_copy['scaled_variance.1'].fillna((df_copy['scaled_variance.1'].mean()), inplace=True) #Imputing with mean
df_copy['scaled_radius_of_gyration'].fillna((df_copy['scaled_radius_of_gyration'].mean()), inplace=True) #Imputing with mean
df_copy['scaled_radius_of_gyration.1'].fillna((df_copy['scaled_radius_of_gyration.1'].mean()),inplace=True)#Imputing with mean
df_copy['skewness_about'].fillna((df_copy['skewness_about'].mean()), inplace=True) #Imputing with mean
df_copy['skewness_about.1'].fillna((df_copy['skewness_about.1'].mean()), inplace=True) #Imputing with mean
df_copy['skewness_about.2'].fillna((df_copy['skewness_about.2'].mean()), inplace=True) #Imputing with mean
df_copy.isnull().sum().to_frame('Presence of missing values').T #for checking presence of missing values
df_copy.head(10) #check head of dataframe
#meanradius_ratio = float(df_copy['radius_ratio'].mean()) #radius_ratio
#df_copy['radius_ratio'] = np.where(df_copy['radius_ratio'] >np.percentile(df_copy['radius_ratio'], 75), meanradius_ratio,df_copy['radius_ratio']) #replacing with mean
meanpraxis_aspect_ratio = float(df_copy['pr.axis_aspect_ratio'].mean()) #mean pr.axis_aspect_ratio
df_copy['pr.axis_aspect_ratio'] = np.where(df_copy['pr.axis_aspect_ratio'] >np.percentile(df_copy['pr.axis_aspect_ratio'], 75), meanpraxis_aspect_ratio,df_copy['pr.axis_aspect_ratio'])#replacing with mean
meanmaxlength_aspect_ratio = float(df_copy['max.length_aspect_ratio'].mean()) #mean max.length_aspect_ratio
df_copy['max.length_aspect_ratio'] = np.where(df_copy['max.length_aspect_ratio'] >np.percentile(df_copy['max.length_aspect_ratio'], 75), meanmaxlength_aspect_ratio,df_copy['max.length_aspect_ratio'])#replacing with mean
meanscaled_radius_of_gyration = float(df_copy['scaled_radius_of_gyration.1'].mean()) #mean scaled_radius_of_gyration.1
df_copy['scaled_radius_of_gyration.1'] = np.where(df_copy['scaled_radius_of_gyration.1'] >np.percentile(df_copy['scaled_radius_of_gyration.1'], 75), meanscaled_radius_of_gyration,df_copy['scaled_radius_of_gyration.1'])#replacing with mean
meanskewness_about = float(df_copy['skewness_about'].mean()) #mean skewness_about
df_copy['skewness_about'] = np.where(df_copy['skewness_about'] >np.percentile(df_copy['skewness_about'], 75),meanskewness_about ,df_copy['skewness_about'])#replacing with mean
#Boxplots after handeling outliers
col_names=df_copy.columns.values.tolist() #column names
fig, axes = plt.subplots(nrows=9, ncols=2) #create subplots 9rows x 2columns
count=0
for i in range (9):
for j in range (2):
col=col_names[count+j]
sns.boxplot(df_copy[col].values,ax=axes[i][j],color="tab:cyan")
axes[i][j].set_title(col,fontsize=17)
fig=plt.gcf()
fig.set_size_inches(8,20)
plt.tight_layout()
count=count+j+1
Note:
sns.pairplot(df_copy)# pairplot of all features
plt.figure(figsize=(15,10)) #for adjusting figuresize
sns.heatmap(df_copy.corr(),annot=True) #for correlation plot
For reducing the multicolinearity some highly correlated features should be removed.
'compactness', 'circularity', 'distance_circularity', 'radius_ratio', 'scatter_ratio', 'pr.axis_rectangularity', 'max.length_rectangularity', 'scaled_variance', 'scaled_variance.1','scaled_radius_of_gyration', 'skewness_about.2', 'pr.axis_aspect_ratio', these columns are highly correlated(more than 0.50) to each other.
pca_df=df_copy.copy() #Copy of preprocessed dataframe to be used in PCA
pca_df.head() #Head of dataframe
df_copy = df_copy.drop(['scaled_radius_of_gyration' ,'skewness_about.2','radius_ratio','distance_circularity', 'circularity', 'scatter_ratio','scaled_variance.1','pr.axis_rectangularity', 'max.length_rectangularity','scaled_variance'],axis=1) #Dropping
df_copy.head() #Head of updated dataframe
sns.pairplot(df_copy) #Pairplot of features
X = df_copy.drop('class',axis=1) #independent dimensions
y = df_copy['class'] #selecting target column
X = X.apply(zscore) #Scaling with zscore
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=1) #train test split in 70:30 ratio
NB = GaussianNB() #Instantiate the Gaussian Naive bayes
NB.fit(X_train,y_train) #Call the fit method of NB to train the model or to learn the parameters of model
y_predi = NB.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,y_predi)) #for confusion matrix
print('-'*30)
NB_accuracy = accuracy_score(y_test,y_predi)
print('Accuracy of Naive Bayes :{:.2f}'.format(NB_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,y_predi)) #for classification report
print('->'*63)
scores = cross_val_score(NB, X, y, cv=9, scoring='accuracy')#Evaluate a score by cross-validation
max_NB_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
svc = SVC(0.01,kernel ='rbf',gamma='auto') #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(0.05,kernel ='rbf',gamma='auto') #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(0.5,kernel ='rbf',gamma='auto') #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(1,kernel ='rbf',gamma='auto') #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy1 = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy1)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_svc_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
svc = SVC(0.01,kernel = 'linear') #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(0.05,kernel = 'linear') #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(0.5,kernel = 'linear') #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(1,kernel = 'linear') #Instantiate SVC
svc.fit(X_train,y_train) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracy = accuracy_score(y_test,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracy)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_svc)) #for classification report
print('->'*63)
scores = cross_val_score(svc, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_svc_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
knn = KNeighborsClassifier(n_neighbors = 3) #Instantiate KNN with k=3
knn.fit(X_train,y_train) #Call the fit method of KNN to train the model or to learn the parameters of model
y_predict = knn.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,y_predi)) #for confusion matrix
print('-'*30)
KNN_accuracy = accuracy_score(y_test,y_predict)
print('Accuracy of KNN :{:.2f}'.format(KNN_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,y_predi)) #for classification report
print('->'*63)
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_knn_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
dTR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1) #Instantiate Decision Tree with max_depth
dTR.fit(X_train, y_train) #Call the fit method of DT to train the model or to learn the parameters of model
predicted_DTR = dTR.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_DTR)) #for confusion matrix
print('-'*30)
DTR_accuracy = accuracy_score(y_test,predicted_DTR)
print('Accuracy of Decision Tree with Regularization:{:.2f}'.format(DTR_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_DTR)) #for classification report
print('->'*63)
scores = cross_val_score(dTR, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_DTR_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
bagg = BaggingClassifier(base_estimator=dTR, n_estimators=500,random_state=1) #Instantiate Bagging Classifier
bagg = bagg.fit(X_train, y_train) #Call the fit method of Bagging classifier to train the model or to learn the parameters of model
predicted_BAG = bagg.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_BAG)) #for confusion matrix
print('-'*30)
BAG_accuracy = accuracy_score(y_test,predicted_BAG)
print('Accuracy of Decision Tree :{:.2f}'.format(BAG_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_BAG)) #for classification report
print('->'*63)
scores = cross_val_score(bagg, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_bagg_cross_nopca=scores.max() #selecting highest score
print(scores)#print Scores
Aboost = AdaBoostClassifier(n_estimators=50, random_state=1) #Instantiate Adaptive boosting Classifier
Aboost = Aboost.fit(X_train, y_train) #Call the fit method of Adaptive boosting Classifier to train the model or to learn the parameters of model
predicted_ADA = Aboost.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_ADA)) #for confusion matrix
print('-'*30)
ADA_accuracy = accuracy_score(y_test,predicted_ADA)
print('Accuracy of KNN :{:.2f}'.format(ADA_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_ADA)) #for classification report
print('->'*63)
scores = cross_val_score(Aboost, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Aboost_cross_nopca=scores.max() #selecting highest score
print(scores)#print Scores
Gboost = GradientBoostingClassifier(n_estimators = 100,random_state=1) #Instantiate Gradient boosting Classifier
Gboost = Gboost.fit(X_train, y_train)#Call the fit method of Gradient boosting Classifier to train the model or to learn the parameters of model
predicted_GRAD = Gboost.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_GRAD)) #for confusion matrix
print('-'*30)
GRAD_accuracy = accuracy_score(y_test,predicted_GRAD)
print('Accuracy of KNN :{:.2f}'.format(GRAD_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_GRAD)) #for classification report
print('->'*63)
scores = cross_val_score(Gboost, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Gboost_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
#n=100
Rforest = RandomForestClassifier(n_estimators = 100, random_state=1, max_features=3)#Instantiate Random Forest Classifier
Rforest = Rforest.fit(X_train, y_train) #Call the fit method of Random Forest Classifier to train the model or to learn the parameters of model
predicted_RAN = Rforest.predict(X_test) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test,predicted_RAN )) #for confusion matrix
print('-'*30)
RAN_accuracy = accuracy_score(y_test,predicted_RAN )
print('Accuracy of KNN :{:.2f}'.format(RAN_accuracy)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test,predicted_RAN )) #for classification report
print('->'*63)
scores = cross_val_score(Rforest, X, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Rforest_cross_nopca=scores.max()#selecting highest score
print(scores)#print Scores
X = pca_df.drop('class',axis=1) #independent dimensions
y = pca_df['class'] #selecting target column
Xscaled = X.apply(zscore) #Scaling with zscore
Xscaled.head() #head of scaled dataframe
pca = PCA() #PCA
pca.fit(Xscaled) #fit Scaled data into PCA
print(pca.explained_variance_) #Eigen Values
print(pca.components_) #Eigen vectors
print(pca.explained_variance_ratio_) #Percentage of variance explained
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center') #bar plot of Eigen Values vs Variation explained
plt.ylabel('Variation explained')#set y label
plt.xlabel('Eigen Value')# set x label
plt.show()
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_)) #step plot of Eigen Values vs Variation explained
plt.ylabel('Variation explained')#set y label
plt.xlabel('Eigen Value')# set x label
plt.show()
pcad = PCA(n_components=8) #8 features
pcad.fit(Xscaled) #Fit into PCA
print(pcad.components_) #Eigen vectors
print(pcad.explained_variance_ratio_) #Percentage of variance explained
Reduced_dimension = pcad.transform(Xscaled) #reduce dimensions to 8
sns.pairplot(pd.DataFrame(Reduced_dimension)) #pairplot of principle components
X_train1,X_test1,y_train1,y_test1 = train_test_split(Reduced_dimension,y,test_size=0.30,random_state=1) #train test split in 70:30 ratio
NB = GaussianNB() #Instantiate the Gaussian Naive bayes
NB.fit(X_train1,y_train1) #Call the fit method of NB to train the model or to learn the parameters of model
y_predi = NB.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,y_predi)) #for confusion matrix
print('-'*30)
NB_accuracyWithpca = accuracy_score(y_test1,y_predi)
print('Accuracy of Naive Bayes :{:.2f}'.format(NB_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,y_predi)) #for classification report
print('->'*63)
scores = cross_val_score(NB, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_NB_cross=scores.max()#selecting highest score
print(scores)#print Scores
svc1 = SVC(1,kernel ='rbf',gamma='auto') #Instantiate SVC
svc1.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc1.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca1 = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca1)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(0.01,kernel ='rbf',gamma='auto') #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(0.05,kernel ='rbf',gamma='auto') #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(0.5,kernel ='rbf',gamma='auto') #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
scores = cross_val_score(svc, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_svc_cross=scores.max()#selecting highest score
print(scores)#print Scores
svc = SVC(0.01,kernel = 'linear') #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(0.05,kernel = 'linear') #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(0.5,kernel = 'linear') #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
svc = SVC(1,kernel = 'linear') #Instantiate SVC
svc.fit(X_train1,y_train1) #Call the fit method of SVC to train the model or to learn the parameters of model
predicted_svc = svc.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_svc)) #for confusion matrix
print('-'*30)
SVC_accuracyWithpca = accuracy_score(y_test1,predicted_svc) #for accuracy score
print('Accuracy of SVC :',SVC_accuracyWithpca)
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_svc)) #for classification report
print('->'*63)
scores = cross_val_score(svc, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_svc_cross=scores.max()#selecting highest score
print(scores)#print Scores
knn = KNeighborsClassifier(n_neighbors = 3) #Instantiate KNN with k=3
knn.fit(X_train1,y_train1) #Call the fit method of KNN to train the model or to learn the parameters of model
y_predict = knn.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,y_predi)) #for confusion matrix
print('-'*30)
KNN_accuracyWithpca = accuracy_score(y_test1,y_predict)
print('Accuracy of KNN :{:.2f}'.format(KNN_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,y_predi)) #for classification report
print('->'*63)
scores = cross_val_score(knn, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_knn_cross=scores.max() #selecting highest score
print(scores)#print Scores
dTR = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state=1) #Instantiate Decision Tree with max_depth
dTR.fit(X_train1, y_train1) #Call the fit method of DT to train the model or to learn the parameters of model
predicted_DTR = dTR.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_DTR)) #for confusion matrix
print('-'*30)
DTR_accuracyWithpca = accuracy_score(y_test1,predicted_DTR)
print('Accuracy of Decision Tree with Regularization:{:.2f}'.format(DTR_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_DTR)) #for classification report
print('->'*63)
scores = cross_val_score(dTR, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_DTR_cross=scores.max()#selecting highest score
print(scores)#print Scores
bagg = BaggingClassifier(base_estimator=dTR, n_estimators=500,random_state=1) #Instantiate Bagging Classifier
bagg = bagg.fit(X_train1, y_train1) #Call the fit method of Bagging classifier to train the model or to learn the parameters of model
predicted_BAG = bagg.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_BAG)) #for confusion matrix
print('-'*30)
BAG_accuracyWithpca = accuracy_score(y_test1,predicted_BAG)
print('Accuracy of Decision Tree :{:.2f}'.format(BAG_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_BAG)) #for classification report
print('->'*63)
scores = cross_val_score(dTR, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_bagg_cross=scores.max()#selecting highest score
print(scores)#print Scores
Aboost = AdaBoostClassifier(n_estimators=50, random_state=1) #Instantiate Adaptive boosting Classifier
Aboost = Aboost.fit(X_train1, y_train1) #Call the fit method of Adaptive boosting Classifier to train the model or to learn the parameters of model
predicted_ADA = Aboost.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_ADA)) #for confusion matrix
print('-'*30)
ADA_accuracyWithpca = accuracy_score(y_test1,predicted_ADA)
print('Accuracy of KNN :{:.2f}'.format(ADA_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_ADA)) #for classification report
print('->'*63)
scores = cross_val_score(Aboost, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Aboost_cross=scores.max() #selecting highest score
print(scores)#print Scores
Gboost = GradientBoostingClassifier(n_estimators = 100,random_state=1) #Instantiate Gradient boosting Classifier
Gboost = Gboost.fit(X_train1, y_train1)#Call the fit method of Gradient boosting Classifier to train the model or to learn the parameters of model
predicted_GRAD = Gboost.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_GRAD)) #for confusion matrix
print('-'*30)
GRAD_accuracyWithpca = accuracy_score(y_test1,predicted_GRAD)
print('Accuracy of KNN :{:.2f}'.format(GRAD_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_GRAD)) #for classification report
print('->'*63)
scores = cross_val_score(Gboost, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Gboost_cross=scores.max()#selecting highest score
print(scores)#print Scores
#n=100
Rforest = RandomForestClassifier(n_estimators = 100, random_state=1, max_features=3)#Instantiate Random Forest Classifier
Rforest = Rforest.fit(X_train1, y_train1) #Call the fit method of Random Forest Classifier to train the model or to learn the parameters of model
predicted_RAN = Rforest.predict(X_test1) #Predict
print('\033[1m''->'*63)
print('\033[1m''Confusion Matrix\n',confusion_matrix(y_test1,predicted_RAN )) #for confusion matrix
print('-'*30)
RAN_accuracyWithpca = accuracy_score(y_test1,predicted_RAN )
print('Accuracy of KNN :{:.2f}'.format(RAN_accuracyWithpca)) #for accuracy score
print('-'*30)
print('\n Classification Report\n',classification_report(y_test1,predicted_RAN )) #for classification report
print('->'*63)
scores = cross_val_score(Rforest, Xscaled, y, cv=10, scoring='accuracy')#Evaluate a score by cross-validation
max_Rforest_cross=scores.max() #selecting highest score
print(scores)#print Scores
Scores = [('Naive bayes', NB_accuracy,NB_accuracyWithpca,max_NB_cross_nopca,max_NB_cross),
('KNN', KNN_accuracy,KNN_accuracyWithpca,max_knn_cross_nopca,max_knn_cross),
('SVC', SVC_accuracy1,SVC_accuracyWithpca1,max_svc_cross_nopca,max_svc_cross),
('Decision Tree with Regularization',DTR_accuracy,DTR_accuracyWithpca,max_DTR_cross_nopca,max_DTR_cross),
('Bagging',BAG_accuracy,BAG_accuracyWithpca,max_bagg_cross_nopca,max_bagg_cross),
('Adaptive Boosting',ADA_accuracy,ADA_accuracyWithpca,max_Aboost_cross_nopca,max_Aboost_cross),
('Gradient Boosting',GRAD_accuracy,GRAD_accuracyWithpca,max_Gboost_cross_nopca,max_Gboost_cross),
('Random Forest N=100',RAN_accuracy,RAN_accuracyWithpca,max_Rforest_cross_nopca,max_Rforest_cross)] #List of accuracy scores of all models
Scores = pd.DataFrame(Scores,columns=['Model','Accuracy score without PCA and reduced dimensions','Accuracy score with PCA and reduced dimensions','Maximum Accuracy with cross validation without PCA and reduced dimensions','Maximum Accuracy with cross validation and PCA and reduced dimensions']) #Conversion of list to dataframe
Sorted=Scores.sort_values(by='Accuracy score with PCA and reduced dimensions',ascending=True) #Sort values in descending manner
Sorted
ax = Sorted.plot(x='Model', y='Accuracy score without PCA and reduced dimensions', legend=False,rot=90)
ax2 = ax.twinx()
Sorted.plot(x='Model', y='Accuracy score with PCA and reduced dimensions', ax=ax2, legend=False, color="r")
ax.figure.legend()
ax.legend(bbox_to_anchor=(1.1,0), bbox_transform=fig.transFigure)
plt.show()
ax = Sorted.plot(x='Model', y='Maximum Accuracy with cross validation without PCA and reduced dimensions', legend=False,rot=90)
ax2 = ax.twinx()
Sorted.plot(x='Model', y='Maximum Accuracy with cross validation and PCA and reduced dimensions', ax=ax2, legend=False, color="r")
ax.figure.legend()
ax.legend(bbox_to_anchor=(1.4,0), bbox_transform=fig.transFigure)
plt.show()
For SVC without PCA and reduced dimensions in terms of Precison and Recall, besy hyper parameters are C=1 and kernel ='rbf'.
In cross validation, the accuracy score of all models has increased significantly. On one instance Gradient boosting has accuracy of 100%.